Predicting Outcomes for New Data

Predicting the target values for new observations is implemented the same way as most of the other predict methods in R. In general, all you need to do is call predict on the object returned by train and pass the data to be predicted.

There are two ways to pass the data:

The first way is preferable if you want predictions for data already included in the Task.

Just as train, the predict function has a subset argument, so you can set aside different portions of the data in Task for training and prediction.

In the following example we fit a gradient boosting machine to every second observation of the BostonHousing data set and make predictions on the remaining data in bh.task.

n = bh.task$task.desc$size
train.set = seq(1, n, 2)
test.set = seq(2, n, 2)
lrn = makeLearner("regr.gbm", n.trees = 100)
## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
## 
## Attaching package: 'survival'
## 
## Das folgende Objekt ist maskiert from 'package:caret':
## 
##     cluster
## 
## Loaded gbm 2.1
mod = train(lrn, bh.task, subset = train.set)

task.pred = predict(mod, task = bh.task, subset = test.set)
task.pred
## Prediction:
## predict.type: response
## threshold: 
## time: 0.00
##    id truth response
## 2   2  21.6    22.29
## 4   4  33.4    23.34
## 6   6  28.7    22.41
## 8   8  27.1    22.13
## 10 10  18.9    22.13
## 12 12  18.9    22.13

The second way is useful if you want to predict data not included in the Task.

In the following we cluster the iris data set without the target variable. All observations with an odd index are included in the Task and used for training. Predictions are made for the remaining observations.

n = nrow(iris)
iris.train = iris[seq(1, n, 2), -5]
iris.test = iris[seq(2, n, 2), -5]
task = makeClusterTask(data = iris.train)
mod = train("cluster.XMeans", task)
## Loading required package: RWeka
## Error: Required Weka package 'XMeans' is not installed.
## Timing stopped at: 0.066 0.006 0.076
newdata.pred = predict(mod, newdata = iris.test)
## Error: Objekt 'crim' nicht gefunden
## Timing stopped at: 0.001 0.001 0.002
newdata.pred
## Error: Objekt 'newdata.pred' nicht gefunden

Changing the type of prediction

The result of predict depends on the nature of the Task and the type of prediction chosen when creating the Learner. For example in case of survival analysis the default is to predict the response. For the Cox proportional hazards model we get the values of the linear predictor as shown in the following.

n = lung.task$task.desc$size
train.set = seq(1, n, 2)
test.set = seq(2, n, 2)
mod = train("surv.coxph", lung.task, subset = train.set)

pred = predict(mod, task = lung.task, subset = test.set)
pred
## Prediction:
## predict.type: response
## threshold: 
## time: 0.00
##    id truth.time truth.event response
## 4   2        210        TRUE  0.57483
## 7   4        310        TRUE  0.69902
## 9   6        218        TRUE -0.26940
## 11  8        170        TRUE -0.05084
## 17 10        613        TRUE  0.26961
## 19 12         61        TRUE  0.26804

It is also possible to predict time-dependent probabilities. In order to do so you have to create a Learner and set predict.type = "prob".

lrn = makeLearner("surv.coxph", predict.type = "prob")
mod = train(lrn, lung.task, subset = train.set)

pred = predict(mod, task = lung.task, subset = test.set)
head(pred$data[,1:8])
##    id truth.time truth.event response.1 response.2 response.3 response.4
## 4   2        210        TRUE     0.9821     0.9821     0.9821     0.9821
## 7   4        310        TRUE     0.9797     0.9797     0.9797     0.9797
## 9   6        218        TRUE     0.9923     0.9923     0.9923     0.9923
## 11  8        170        TRUE     0.9904     0.9904     0.9904     0.9904
## 17 10        613        TRUE     0.9868     0.9868     0.9868     0.9868
## 19 12         61        TRUE     0.9868     0.9868     0.9868     0.9868
##    response.5
## 4      0.9821
## 7      0.9797
## 9      0.9923
## 11     0.9904
## 17     0.9868
## 19     0.9868

Predictions are encapsulated in a special Prediction object.

Accessing the prediction

A Prediction object is a list. The most important element is "data" which is a data.frame that contains columns with the true values of the target variable (in case of supervised learning problems) and the predictions.

In the following the predictions on the BostonHousing and the iris data sets are shown. As you may recall, the predictions in the first case were made from a Task and in the second case from a data.frame.

## Result of predict with data passed via task argument
head(task.pred$data)
##    id truth response
## 2   2  21.6    22.29
## 4   4  33.4    23.34
## 6   6  28.7    22.41
## 8   8  27.1    22.13
## 10 10  18.9    22.13
## 12 12  18.9    22.13
## Result of predict with data passed via newdata argument
head(newdata.pred$data)
## Error: Objekt 'newdata.pred' nicht gefunden

As you can see when predicting from a Task, the resulting data.frame contains an additional column, called id, which tells us which element in the original data set the prediction corresponds to.

Extract Probabilities

The predicted probabilities can be extracted from the Prediction using function getProbabilities. Here is another cluster analysis example. We use fuzzy c-means clustering on the mtcars data set.

lrn = makeLearner("cluster.cmeans", predict.type = "prob")
## Loading required package: e1071
## 
## Attaching package: 'e1071'
## 
## Das folgende Objekt ist maskiert from 'package:mlr':
## 
##     impute
## 
## Loading required package: clue
## Warning: there is no package called 'clue'
## Error: For learner cluster.cmeans please install the following packages:
## clue
mod = train(lrn, mtcars.task)
## Error: Task 'mtcars-example' is 'cluster', but learner 'surv.coxph' is for
## 'surv'!
pred = predict(mod, task = mtcars.task)
## Error: Objekt 'inst' nicht gefunden
## Timing stopped at: 0.005 0 0.005
head(getProbabilities(pred))
## Error: Prediction was not generated from a ClassifTask or ClusterTask!

In case of classification problems there are some more things worth mentioning.

Classification

Per default, class labels are predicted.

## Linear discriminant analysis on the iris data set
mod = train("classif.lda", task = iris.task)
## Loading required package: MASS
pred = predict(mod, task = iris.task)
pred
## Prediction:
## predict.type: response
## threshold: 
## time: 0.01
##   id  truth response
## 1  1 setosa   setosa
## 2  2 setosa   setosa
## 3  3 setosa   setosa
## 4  4 setosa   setosa
## 5  5 setosa   setosa
## 6  6 setosa   setosa

A confusion matrix can be obtained by calling getConfMatrix.

getConfMatrix(pred)
##             predicted
## true         setosa versicolor virginica -SUM-
##   setosa         50          0         0     0
##   versicolor      0         48         2     2
##   virginica       0          1        49     1
##   -SUM-           0          1         2     3

In order to get predicted posterior probabilities we have to create a Learner with the appropriate predict.type.

lrn = makeLearner("classif.rpart", predict.type = "prob")
## Loading required package: rpart
mod = train(lrn, iris.task)

pred = predict(mod, newdata = iris)
head(pred$data)
##    truth prob.setosa prob.versicolor prob.virginica response
## 1 setosa           1               0              0   setosa
## 2 setosa           1               0              0   setosa
## 3 setosa           1               0              0   setosa
## 4 setosa           1               0              0   setosa
## 5 setosa           1               0              0   setosa
## 6 setosa           1               0              0   setosa

In addition to the probabilities, class labels are predicted by choosing the class with the maximum probability and breaking ties at random.

As mentioned above, the predicted posterior probabilities can be accessed via the getProbabilities function.

head(getProbabilities(pred))
##   setosa versicolor virginica
## 1      1          0         0
## 2      1          0         0
## 3      1          0         0
## 4      1          0         0
## 5      1          0         0
## 6      1          0         0

Adjusting the threshold in binary classification

In case of binary classification, two things are worth mentioning. As you may recall, we can specify which of the two classes should be considered as positive class when generating the Task. Moreover, we can set the threshold value that is used to map the predicted posterior probabilities to class labels. Note that for this purpose we need to create a Learner that predicts probabilities.

To illustrate binary classification, we use the BreastCancer data set from the mlbench package.

lrn = makeLearner("classif.rpart", predict.type = "prob")
mod = train(lrn, task = bc.task)

pred = predict(mod, task = bc.task)
pred
## Prediction:
## predict.type: prob
## threshold: benign=0.50,malignant=0.50
## time: 0.01
##   id     truth prob.benign prob.malignant  response
## 1  1    benign     0.98780         0.0122    benign
## 2  2    benign     0.12963         0.8704 malignant
## 3  3    benign     0.98780         0.0122    benign
## 4  4    benign     0.01724         0.9828 malignant
## 5  5    benign     0.98780         0.0122    benign
## 6  6 malignant     0.01724         0.9828 malignant
pred$threshold
##    benign malignant 
##       0.5       0.5

As you can see the default threshold is 0.5, that is an example is assigned to the class with maximum posterior probability.

We can adjust the threshold using the function setThreshold. Now, we set the threshold for the positive class to 0.8 (that is, an example is assigned to the positive class if its posterior probability exceeds 0.8). Which of the two classes is the positive one can be seen by accessing the Task.

## Label of the positive class
bc.task$task.desc$positive
## [1] "benign"
## Set the threshold value for the positive class
pred = setThreshold(pred, 0.8)
pred
## Prediction:
## predict.type: prob
## threshold: benign=0.80,malignant=0.20
## time: 0.01
##   id     truth prob.benign prob.malignant  response
## 1  1    benign     0.98780         0.0122    benign
## 2  2    benign     0.12963         0.8704 malignant
## 3  3    benign     0.98780         0.0122    benign
## 4  4    benign     0.01724         0.9828 malignant
## 5  5    benign     0.98780         0.0122    benign
## 6  6 malignant     0.01724         0.9828 malignant
pred$threshold
##    benign malignant 
##       0.8       0.2

Note that in the binary case getProbabilities extracts the posterior probabilities of the positive class only.

head(getProbabilities(pred))
## [1] 0.98780 0.12963 0.98780 0.01724 0.98780 0.01724

Visualizing the prediction

Function plotLearnerPrediction allows to visualize predictions, e.g., for teaching purposes or exploring models. It trains the chosen learning method for 1 or 2 selected features and then displays the predictions via ggplot.

For classification, we get a scatter plot of 2 features (per default the first 2 in the data set). The plotting symbols show the true class labels of the data points. The color indicates if observations are misclassified. The posterior probabilities (if the learner under consideration supports this) are represented by the background color.

The plot title displays the ID of the Learner (in the following example CART), its parameters, its training performance and its cross-validation performance. mmce stands for mean misclassification error, i.e., the error rate. See the sections Performance and Resampling for further explanations.

lrn = makeLearner("classif.rpart", id = "CART")
plotLearnerPrediction(lrn, task = iris.task)

plot of chunk unnamed-chunk-14

For regression, there exist two types of plots. The 1D plot shows the target values in dependence of 1 feature, the regression curve and if the chosen learner supports this the estimated standard error.

plotLearnerPrediction("regr.lm", features = "lstat", task = bh.task)

plot of chunk unnamed-chunk-15

The 2D variant, as in the classification case, generates a scatter plot of 2 features. The fill color of the dots illustrates the value of the target variable "medv", the background colors show the estimated mean. The plot does not represent the estimated standard error.

plotLearnerPrediction("regr.lm", features = c("lstat", "rm"), task = bh.task)

plot of chunk unnamed-chunk-16